Multimodal learning
General-purpose neural networks capable of handling diverse inputs and output tasks
See:
Resources
- Multimodal Deep Learning
- https://paperswithcode.com/methods/category/vision-and-language-pre-trained-models
- Vision Language models: towards multi-modal deep learning
Code
- #CODE Pykale - Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research
- #CODE Unilm - Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Courses
Books
- #BOOK Multimodal Deep Learning (Akkus 2023)
- https://slds-lmu.github.io/seminar_multimodal_dl/index.html
References
-
#PAPER Multi-modal Transformer for Video Retrieval (Gabeur 2020)
-
#PAPER #REVIEW Recent Advances and Trends in Multimodal Deep Learning: A Review (Summaira 2021)
-
#PAPER Perceiver: General Perception with Iterative Attention (Jaegle 2021)
- https://www.zdnet.com/article/googles-supermodel-deepmind-perceiver-is-a-step-on-the-road-to-an-ai-machine-that-could-process-everything/
- Multi-model with image, audio, video, 3d point clouds
-
#PAPER PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python (Lu 2021)
-
#PAPER Perceiver IO: A General Architecture for Structured Inputs & Outputs (Jaegle 2021)
-
#PAPER VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text (Akbari 2021)
- #CODE https://paperswithcode.com/paper/vatt-transformers-for-multimodal-self
- VATT is trained to learn multimodal representations from unlabeled data using Transformer architectures
-
#PAPER NÃœWA: Visual Synthesis Pre-training for Neural visUal World creAtion (Wu 2021)
- #CODE https://paperswithcode.com/paper/nuwa-visual-synthesis-pre-training-for-neural
- Paper explained
- NÃœWAÂ consists of an adaptive encoder that takes either text or visual input, and a pre-trained decoder shared by 8 visual tasks
- 3D Nearby Attention mechanism (3DNA) is proposed to reduce computational complexity and improve visual quality of results, by considering the locality characteristics for both spatial and temporal axes to better deal with the nature of the visual data
-
#PAPER data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language (Baevski 2022)
-
#PAPER A Generalist Agent (Reed 2022)
- Paper explained
- New approach, inspired by large-scale language models, that acts a single generalist agent. The agent, called Gato, is built to work as a multi-modal, multi-task, multi-embodiment generalist policy
-
#PAPER Towards artificial general intelligence via a multimodal foundation model (Fei 2022)
-
#PAPER Language Models are General-Purpose Interfaces (Hao 2022)
-
#PAPER NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis (Wu 2022)